The One Page Model Checker 



lO ! Jason E. Holt (isrl@lunkwill.orq) 

Q 

February 1, 2008 



O 



3 



O 



Abstract 



■ We show how standard IPC mechanisms can be used with the fork() 

£S| \ system call to perform explicit state model checking on all interleavings 

of a multithreaded application. We specifically show how to check for 
deadlock and race conditions in programs with two threads. Our tech- 
niques are easy to apply to other languages, and require only the most 
l_J ■ rudimentary parsing of the target language. Our fundamental system fits 

^ ' in one page of C code. 



1 Introduction and Related Work 

> 

' Debugging multithreaded applications is hard. Race conditions mean that fail- 

00 . ures may be nondeterministic. Deadlock can be hard to trace, because it involves 

' behaviors from multiple concurrent threads. Tools to prove that a piece of code 

has no such behaviors can help find such errors, and instill confidence that the 
programs will work correctly when deployed. 
, Here we describe a method for measuring the behavior of multithreaded pro- 

grams through all possible execution interleavings. Our work is straightforward 
, and applicable to many different programming languages, although it also has 

some significant, fundamental limitations. 

Several other techniques have been proposed which relate to model check- 
■ ing multithreaded applications. Visser et al|3] built an optimized system which 

implements the Java VM and can prove properties about multithreaded ap- 
plications. Mercer and JonespQ built an explicit state model checker designed 
around specific CPUs which verifies properties of compiled code. 

In 1997, Savage et al |2j introduced a tool which enforces a locking discipline 
on resources to prevent nondeterministic behavior caused by OS scheduling. 
While not a model checker per se, this tool also aims to help programmers 
reduce the uncertainty associated with multithreaded applications. 



2 System Overview 

The principle behind our system is easy to understand. Given a program with 
two threads, we wish to search for particular conditions like deadlock among all 
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possible thread intcrlcavings. For example, if the first thread executes print(ab) 
and the second executes print(12), then at least the following behaviors are pos- 
sible: the first thread could run to completion, followed by the second, producing 
the output string "abl2" , or the second could run to completion first, producing 
"12ab" . If the operating system chooses to switch between the threads while 
they're running, the strings "alb2", "al2b", or "la2b" might also occur. Of 
course, if printQ isn't thread safe, the program might also exhibit other behav- 
iors or crash entirely. 

To test all possible interleavings, the model checker must have a way of trying 
different execution paths and must keep track of which paths have been explored. 
Our system's simplicity and compactness comes from using the Unix-standard 
fork() system call, which forks a process at the point of the function call into two 
independent, identical processes. fork{) can be called inside branching, looping 
or subroutine constructs, and both the original and newly created child processes 
will return to the following statement and continue as if nothing happened. 
This is different from most thread implementations, which must be called with 
a subroutine to execute in the new thread, after which the thread terminates. 

fork{) allows our model checker to be implemented much like a normal recur- 
sive depth-first search. In such a search, the program's stack is used to implicitly 
keep track of the current progress through the state space being searched. In 
place of nested function calls, our implementation creates a child process at 
each branching point to explore the next level of the search. This can happen 
surprisingly quickly, due to efficient OS techniques like copy-on-write paging 
which allows efficient forking; our Athlon64 3000+ running Debian GNU/Linux 
can perform over 7,000 fork() operations per second. 

Since our model checker operates on programs with two threads, we care- 
fully synchronize pairs of processes to implement the possible execution orders. 
We call this technique "the buddy system" , and a pair of cooperating threads 
"buddies" . Each pair of buddies maintains a data structure in shared memory 
containing semaphores, execution path and other elements required for IPC and 
coordination. 

And since fork() creates processes, not threads, we implement threadlike 
behavior using shared memory. Threads do not get separate copies of program 
variables as processes do, so we create a structure in a shared memory segment 
where buddy processes keep all application variables. 

2.1 Example 

We will expand the earlier example into separate statements to show how our 
technique works. Thread executes print(a); print(b), while thread 1 executes 
print(l); print(2). First, we instrument our code by placing calls to a function 
hook() before each statement, then calling done() at the end of execution: 

threadOO { 

hookO ; print (a) ; 
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hookO ; 
done () ; 



print (b) ; 



> 



threadlO 



{ 

print (1) ; 
print (2) ; 



hookO ; 
hookO ; 
done () ; 

> 



The code is then compiled and linked with our model checking code. When 
it executes, fork() is called to create a separate process for each thread, each of 
which executes into the first hookQ. There are only two possible ways in which 
the threads can execute their first statements; either threadO or threadl goes 
first. Each "thread" (really a process) thus forks into two processes, resulting in 
two parents and two children. The pair of parents become buddies, and the pair 
of children become buddies. The parent processes each wait for their children 
to terminate, much as a recursive function calls itself and waits for the recursive 
call to finish. The child process for thread 1 blocks using a semaphore, waiting 
for its buddy in thread to execute a single statement. The buddy process 
does this by returning from the call to hook(), which allows the first statement, 
print(a), to execute. Then that process hits the next call to hook() and signals 
to its buddy that it has finished executing a statement. 

Now the process repeats; either threadO can execute another statement, or 
threadl can execute its first statement. Again, each buddy forks, with the 
parents waiting for the children to finish. The child process for thread again 
goes first, returning to execute print(b), then calling done(), which signals to 
the buddy process that thread has completed execution. With no remaining 
alternatives, thread 1 now runs to completion, giving a resulting output of 



Once the grandchildren of the original two threads have each terminated, 
the children continue running. Since the grandchildren explored the case in 
which thread executed another statement, the children explore what happens 
when thread 1 runs. Thread 1 returns from its hook and executes print(l), 
then signals its buddy. Once again, the children have two alternatives, so they 
fork another pair of grandchildren. The thread grandchild executes its second 
statement and terminates, allowing thread 1 to complete and producing the 
string "alb2". The children again execute a statement from thread 1, print(2), 
after which thread runs to completion, producing "al2b" . Now the original 
two threads can continue, executing print(l) from thread 1, and forking another 
set of children, which fork grandchildren as before, producing the strings "lab2", 



■ab!2". 



Ia2b", and "12ab". 
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3 Code Instrumentation, Language Independence 
and Statement Atomicity 

In order for our system to work properly, application code must be properly 
instrumented, by making calls to hook() before each program statement. These 
statements are assumed to execute atomically, which does not generally happen 
in current systems, but which can be assured using a technique we describe later 
in this section. 

This instrumentation process is very simple, and works independently of 
language constructions like loops, function calls and branches. For example, we 
first implemented the example we gave in the last section as follows, essentially 
the same as we listed it before: 

if (child) { 
// thread 

hookO ; printf (' 'a") ; 
hookO ; printf (' 'b") ; 
doneO ; 

y else { 

// thread 1 

hookO ; printf (' '1") ; 
hookO ; printf (' '2") ; 
doneO ; 

} 

But later, we generalized it to work for arbitrary strings using a separate 
function and a loop: 

void str(char *s) { 
int i ; 

for(i=0; s [i] ; i++) { 

hookO; b->common. outstr [b->common . outidx++] = s [i] ; 

} 

> 

mainO { 

if (child) { 
// thread 
str("ab") ; 
doneO ; 

} else { 
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// thread 1 
str("12"); 
doneO ; 

} 

} 

A naive, automated instrumentation tool might have added additional hookQ 
calls as follows, but that would have merely added overhead to the model check- 
ing process: 

void str(char *s) { 
int i ; 

hookO; for(i=0; s [i] ; i++) { 

hookO ; b->common.outstr [b->common . outidx++] = s [i] ; 

} 

} 

mainO { 

if (child) { 
// thread 
hookO; str("ab"); 
doneO ; 

} else { 

// thread 1 

hookO; str("12"); 

doneO ; 

} 

} 

While modifying program source code before model checking is generally 
deprecated, we feel our technique has several interesting features. First, adding 
calls to hook() before each program statement is easy to do automatically for 
reasonably written source code, even without constructing a formal parser for 
the target language. The implementor must simply avoid placing calls where 
they would cause syntax errors in the program, such as in between function 
declarations. Redundant calls to hookQ add overhead, but don't otherwise 
break our system. This makes our system straightforward to implement in a 
variety of languages, whereas traditional systems require significant adaptation 
to target languages in order to properly model their behavior. 

Second, our instrumentation can be used to perform other tasks beside model 
checking, by changing the behavior of hookQ. For instance, hookQ could be 
modified to implement white box testing, in which test cases are constructed 
which together must execute all code branches. 
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Third, our instrumentation can be left in place to guarantee the statement- 
level atomicity assumed by our system. Generally speaking, modern CPUs 
offer only machine instruction level atomicity - the OS may interrupt a process 
between any two instructions and begin execution of a different process. Model 
checkers like EstespP work on these machine instructions directly, but their 
results can only be applied to that particular compilation of the application on a 
particular CPU. This may prove to be the only way to prove useful thread safety 
and liveness properties about unmodified code on a particular CPU, and would 
tend to suggest that dealing with multithreaded applications in their original 
high-level language form doesn't even make sense. On the other hand, if calls 
to hookQ are left in place in distributed code, the function can be modified to 
essentially make each application statement into an individual critical section. 

Admittedly, this adds a large amount of overhead to the code, since system 
calls to raise and lower a semaphore must be made for each program statement. 
But in modern high level scripting languages particularly, programs tend to 
have fewer statements, with powerful built-in commands having relatively high 
execution costs. Such languages may be particularly difficult to verify at the 
machine code level, since they run via large, complex interpreters. As a first 
approximation of the overhead our technique would add, we wrote a C program 
which forks into two processes, each of which loops f, 000, 000 times. In the 
loop, each process grabs a semaphore, adds the current index to a variable, 
then releases the semaphore. The program performed the 4,000,000 semaphore 
operations in about 0.8 seconds on our Athlon64 3000+. We then ran a program 
in Perl which loops 2,000,000 times in a single process, likewise adding up the 
index values. It also took about 0.8 seconds, suggesting that efficient C-based 
hooks in the perl code to ensure statement-level atomicity might add only 50% 
overhead to such a program, and possibly less for a program using fewer, more 
costly operations than simple loops and additions. 

4 Formal Definitions 

Here we define terminology used in the algorithms described in the next sections. 

• Assume there exist functions hookQ, which is called before execution of 
each application thread program statement, and doneQ, which is called 
after the last statement in each thread. Application threads may include 
most usual language features, such as branching, looping and function 
calls (see section 0J. 

• We define the execution counter for a thread to be the number of times 
hookQ has been called since the beginning of the thread's execution. In- 
tuitively, this is the number of statements executed in that thread, plus 
one. 

• If so is the value of the execution counter for thread and s± is the 
corresponding counter for thread 1, the pair (sq, s\) forms the combined 
execution counter for the two threads. 
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• An execution trace is denned to be a string £ G {0, 1}* which represents 
the order in which statements from the two threads were executed to 
reach a particular combined execution counter. In our earlier example, 
the execution trace 0011 corresponds to the output "abl2". 

• Let Vq and V\ represent the vectors of shared variable values for threads 
and 1 having an execution trace £ and combined execution counter C. 
The tuple I = (Vo,Vi,t, C) is a partial interleaving for the two threads 
(partial, since the threads may not yet have run to completion). 

5 Search Algorithm 

Our algorithm for performing a depth first search on all possible execution 
intcrlcavings of two threads £0 and £1 is as follows: 

• Base case: Let I be the initial partial interleaving for threads £0, tl, 
representing the program state at the first call to hook() in each thread, 
before any application statements have executed. 

• Recursion: Given a partial interleaving / for threads £0,tl, 

— Run any user-supplied code for checking conditions. 

— If done() has been called in both threads, terminate the current 
thread and indicate successful program execution for a single com- 
plete interleaving. 

— If done() has been called in only one thread, allow the other thread 
to continue to completion. 

— Otherwise, fork both threads to create children £0',£1'. Parents both 
wait for termination of the children. £0' returns from hook(), allowing 
a single statement to execute while £1' blocks. Then the recursive step 
is performed again on £0', £1' with a new partial interleaving When 
£0',£1' have terminated, £1 returns, executing a single statement, and 
then the recursive step is performed again for £0, £1 with new partial 
interleaving 

Omitting ^include statements and helper functions for setting up semaphores 
and shared memory, our C implementation of this algorithm fits in one printed 
page of code (80 columns by 65 lines). 

6 Detecting Race Conditions and Pruning the 
Search Space 

Note that in the above example, program output is entirely dependant on the 
order in which the OS schedules the two threads for execution. Such nondeter- 
ministic behavior is almost never intended, and usually represents a bug in the 
code. 



7 



Consequently, we provide a technique for verifying that threads behave the 
same regardless of the order in which they are executed. This technique also 
makes it easy to avoid unnecessary exploration of the space of possible inter- 
leavings. Formally, we define a race condition as follows: 

• Let I = (V~o,Vi,t,C) and I' = (Vq, V{, t', C) represent partial interleav- 
ings such that C — C . That is, I and I 1 are partial interleavings which 
have reached the same execution counter in each thread but potentially 
through a different order of execution. / and /' form a race condition 
iff I ^ I'. 

To implement this technique, we maintain a shared table keyed on the com- 
bined execution counters explored while searching the state space. Each ta- 
ble entry records the partial interleaving at that combined execution counter. 
When a particular combined execution counter is reached via a different execu- 
tion trace, the current partial interleaving is compared against the stored partial 
interleaving. If they differ, the two partial interleavings are displayed as exam- 
ples of execution paths capable of producing differing behaviors. A program 
with no race conditions will of course display only the single possible outcome 
of program execution. 

The second purpose of this table is to record which combined execution 
counters have been reached before. Since our algorithm performs a depth first 
search on the possible thread interleavings, the second occurrance of a combined 
execution counter can only occur once all the remaining interleavings from that 
point on have already been explored. Consequently, if a race condition does not 
occur at a particular explored partial interleaving, there is no need to explore it 
again since the two threads are in exactly the same state as they were the last 
time. 

Although we have not yet been able to derive a formula for the complexity of 
our pruning algorithm, it is clear that this pruning technique is at least an order 
of magnitude improvement over an exhaustive search. We ran our algorithm on 
pairs of threads each executing 3 to 8 statements. Values represent the total 
number of calls to hookQ, which roughly corresponds to the number of states 
explored. 

Technique 3 4 5 6 7 8 

Exhaustive 30 112 420 1584 6006 22876 
Pruning 18 32 50 72 98 128 
This addition was surprisingly easy to add to our system; it required less 
than a page of code in changes. 

7 Supporting IPC 

Our implementation supports multiple semaphores which may be used by the 
user application for interprocess communication. This complicates our system, 
since a thread may block until the other thread releases a particular semaphore, 
and complicates the actual implementation even more, due to practical issues 
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regarding process cleanup, IPC and resource management. Here we give the 
algorithm for supporting an arbitrary number of application semaphores. 

• Let the definition of a partial interleaving be extended to include a set of 
semaphores S = {sQ..s n }, which may be up or down. 

• Let the function down(i) be a valid application statement (to be preceded 
by a call to hookQ). down(i) causes the current thread to lower Sj if it is 
up, and do nothing otherwise. 

• Let the function up{i) perform the complimentary operation, with the 
addition that if Sj is already up, the thread blocks until the other thread 
calls down(i). If the other thread is already blocking, report deadlock and 
terminate both threads. 

• Base case: Let / be the initial partial interleaving for threads tO, tl, 
representing the program state at the first call to hookQ in each thread, 
before any application statements have executed. 

• Recursion: Given a partial interleaving / for threads iO, tl, 

— Run any user-supplied code for checking conditions. 

— If doneQ has been called in both threads, terminate the current 
thread and indicate successful program execution for a single com- 
plete interleaving. 

— If done() has been called in only one thread, allow the other thread 
to continue to completion. If the other thread is blocked, report an 
error, since the thread will block forever. 

— If a thread is blocked, allow the other thread to execute another 
statement. 

— Otherwise, fork both threads to create children tO', tl'. Parents both 
wait for termination of the children. tO' returns from hookQ, allowing 
a single statement to execute while tl' blocks. Then the recursive step 
is performed again on tO' , tl' with a new partial interleaving When 
tO', tl' have terminated, tl returns, executing a single statement, and 
then the recursive step is performed again for tO, tl with new partial 
interleaving I" . 

8 Limitations 

In most thread implementations, threads share all global variables, but each has 
its own stack. Local variables are thus maintained independantly from other 
threads. Our implementation presently provides no support for such variables, 
and assumes that all thread state can be monitored via the shared variables and 
execution counter. It would be easy to create a second structure associated with 
each buddy process for storing variables unique to each thread, and account for 
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that additional state information when checking for race conditions and pruning 
the state space. 

As we described in section [*2 our assumption that program statements are 
performed atomically is not at all guaranteed by real computers, unless our 
technique is employed at a machine code level. To achieve reported results in 
practice, statement-level atomicity would need to be enforced by the operating 
system, language interpreter, or by using a modified hookQ as we described. 

While using fork() to store program execution state makes our system very 
simple to implement, it imposes a significant amount of system overhead. The 
system resources for two processes, including two process table entries, are re- 
quired for each level of depth in the search space, which corresponds to the 
number of statements executed by the combined threads. This makes even sim- 
ple programs, like a pair of threads which each loops 1,000,000 times, impossible 
to verify with our system. 

Finally, our current system is limited to programs with two threads. See 
section 1*071 for discussion on removing this limitation. 

9 Implementation and Performance 

As we showed in section H3 our pruning algorithm performs far better than an 
exhaustive search. Default (though modifiable) OS limitations on the number 
of available semaphore sets and our implementation's inefficient use of single 
semaphore sets rather than multiple semaphores in multiple sets limits us to 
running applications which execute a total of about 22 steps. These it handles 
in under 0.1 seconds. The resulting maximum of 44 live processes imposes no 
noticeable memory consumption. 

While our fundamental search algorithm can be implemented in about a page 
of code, our full system supporting application semaphores, race detection and 
pruning, and with helper functions, debugging code, and whitespace currently 
weighs in at 611 lines of C. The implementation requires 4 semaphores for each 
level of depth in the DFS, and requires n 3 storage in the DFS depth to maintain 
the table of partial interleavings complete with program execution paths. For 
our current depth limit of 22, this amounts to about 100k of memory. 

10 Future Work 

Our system is limited to programs with two threads. Since our system is modeled 
after the traditional recursive depth first search algorithm, there is a clear path 
for extending our algorithm to support any number of threads. Rather than 
pairs of parent processes spawning a single pair of children, each parent in a 
set of n parent buddies would iteratively spawn n-1 children (waiting each time 
for the previous child to die). The n n- member buddy sets (cliques?) would 
then be dispatched, exploring the paths in which each child executes its next 
statement. Instead of a maximum 2k live processes for a k-level DFS, up to nk 
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processes would exist. 

Implementation, in our experience, might be time consuming, due to the 
inherent difficulty humans seem to have keeping track of multiprocess systems. 
However, with careful planning, and for programmers more experienced with 
multiprocess applications, this extension should not prove too difficult. 

One feature that might be quite easy to add is the ability for parents to 
run without waiting for their children. Unchecked, this would act like a "fork 
bomb" , potentially swamping the system as the entire search tree unfolded at 
once. But with a limit on how many pairs of processes could run at once, 
our system would immediately be able to run on SMP systems with arbitrary 
numbers of processors, limited only by the size of the process table and the 
system memory. 

At the other extreme, with only two processes running at once, it's unnec- 
essary to allocate each pair of buddies their own set of semaphores, as our first 
implementation does. With more careful use and management, we suspect that 
a single set would suffice, avoiding system limits on available semaphores. The 
state table also need not require so much storage; cryptographic hash functions 
can be used to reduce arbitrary amounts of data to a 128 bit digest which can 
be stored in place of the actual partial interleaving. An expected 2 64 entries 
would have to be made before any pair of entries would have the same digest, 
which is far more than any PC can store today. 

As we described in section [21 our system should be easy to implement in 
a variety of languages, possibly just by using cross-language extensions to call 
our original C implementation. There are also other uses for hookQ which we 
described but have not implemented. 

11 Conclusion 

We gave a straightforward technique for checking thread-related properties of 
programs with two threads. Our technique is general-purpose and language- 
independent, and may be particularly suited to modern high-level scripting 
languages due to the difficulty of machine-code level model checking for these 
languages and the overhead required to enforce the statement-level atomicity 
required by our system. 

Our system performs much better than an exhaustive search of the state 
space, understands application semaphore use, and detects deadlock and race 
conditions. Furthermore, our implementation is quite compact; our exhaustive 
search algorithm fits in one printed page of C code, and our complete imple- 
mentation fits in under ten. 
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